A Study of Real World I/O Performance in Parallel Scientific Computing
نویسندگان
چکیده
Parallel computing is indisputably present in the future of high performance computing. For distributed memory systems, MPI is widely accepted as a de facto standard. However, I/O is often neglected when considering parallel performance. In this article, a number of I/O strategies for distributed memory systems will be examined. These will be evaluated in the context of COOLFluiD, a framework for object oriented computational fluid dynamics. The influence of the system and software architecture on performance will be studied. Benchmark results will be provided, enabling a comparison between some commonly used parallel file systems. 1 Motivation and Problem Description 1.1 Parallel Programming Numerical simulation and other computationally intensive problems are often successfully tackled using parallel computing. Frequently these problems are too large to solve on a single system or the time needed to complete them makes single-CPU calculation unpractical. Successful parallelisation is usually measured by the problem “speedup”. This quantity indicates how much faster a given problem is solved on multiple processors, compared to the solution time on one processor. More often than not, this speedup is only based on the computationally intensive part of the code, and phases as program startup or data loading and saving elude the test. Also, when the ratio of computation to the input data is high enough, I/O time is negligible in the total execution time. However, when scaling to larger problem sizes (and consequently more processors), one often sees that I/O is becoming an increasingly large bottleneck. The main reason for this is that without parallel I/O, the I/O and calculation potential of a cluster quickly becomes unbalanced. This is visible both in hardware and in software; often there is but a single file server managing data for the B. K̊agström et al. (Eds.): PARA 2006, LNCS 4699, pp. 871–881, 2007. c © Springer-Verlag Berlin Heidelberg 2007 872 D. Kimpe et al. whole cluster. Moreover, traditional I/O semantics do not offer enough expressional power to coordinate requests, leading to file server congestion, reducing the already limited I/O bandwidth even further. 1.2 Computational Fluid Dynamics and COOLFluiD Computational fluid dynamics (CFD) deals with the solution of a system of partial differential equations describing the motion of a fluid. This is commonly done by discretizing these equations on a mesh. Depending on the numerical algorithm, a set of unknowns is associated with either nodes or cells of the mesh. The amount of computational work is proportional to the number of cells. For realistic problems this quickly leads to simulations larger than a single system can handle. COOLFluiD[4] is an object oriented framework for computational fluid dynamics, written in C++. It supports distributed memory parallelisation through MPI, but still allows optimized compilation without MPI for single-processor systems. COOLFluiD utilises parallel I/O for two reasons. One is to guarantee scalability of the code. The other is to hide parallelisation from the end user. During development, a goal was set to mask the differences between serial and parallel builds of COOLFluiD as much as possible. This, among other things, requires that the data files used and generated by the parallel version do not differ from those in the serial version. This depends on parallel I/O, as opening a remote file for writing on multiple processors using posix semantics is ill defined and often leads to corrupted files. 1.3 I/O in a Parallel Simulation There has been much research on the optimal parallel solution of a system of PDEs. However, relatively little study has been devoted to creating scalable I/O algorithms for this class of problems. Generally speaking, there are three reasons for performing I/O during a simulation. At the start of the program, the mesh (its geometric description and an initial value for each of the associated unknowns) needs to be loaded into memory. During the computation, snapshots of the current solution state are stored. Before ending the program, the final solution is saved. In a distributed memory machine, the mesh is divided between the nodes. Consequently each CPU requires a different portion of the mesh to operate. This offers opportunities for parallel I/O, since every processor only accesses distinct parts of the mesh. Figure 1 shows an example of a typical decomposition, and the resulting I/O access pattern. On the left, the partitioned mesh is shown. On the right, the file layout (row-major ordering) can be seen. Color indicates which states are accessed by a given CPU. A Study of Real World I/O Performance in Parallel Scientific Computing 873 Fig. 1. Decomposition and file access pattern of a 3D sphere 2 I/O Strategies Within COOLFluiD, I/O is fully abstracted. This simplifies supporting multiple file formats and access APIs, and allows run-time selection of the desired format. Mesh input and output is provided by file plugins. A file plugin offers a well defined, format independent interface to the stored mesh, and can implement any of the following access strategies: Parallel Random Access: This strategy has the potential to offer the highest performance. It allows every processor to read and write arbitrary regions of the file. If the system architecture has multiple pathways to the file this can be exploited. File plugins implementing this interface enable all CPUs to concurrently access those portions of the mesh required for their calculations. Non-Parallel Random Access: In this model, the underlying file format (or access API) does not support parallel access to the file. Only a single CPU is allowed to open the file, which will be random accessible. This strategy can be used with data present on non-shared resources, for example local disks. Non-Parallel Sequential Access: Sometimes the way data is stored prohibits meaningful true parallel access. For example, within an ASCII based file format, it is not possible to read a specific mesh element without first reading all the previous elements. This is due to the varying stride between the elements. As such, even when the OS and API allow parallel writing to the file, for mesh based applications, this cannot be done without corrupting the file structure. Note that applications that do not care about the relative ordering of the entries in the file can still use parallel I/O to read and write from this file (using shared file pointer techniques). However, as this article studies I/O patterns for mesh based applications this is not taken into consideration. 3 Performance Testing Currently, obtaining good parallel I/O performance is still somewhat of a black art. By making use of the flexibility COOLFluiD offers concerning mesh I/O, an attempt is made to explore and analyse the many different combinations of file system, API and interconnect that can be found in modern clusters.
منابع مشابه
Parallel computing using MPI and OpenMP on self-configured platform, UMZHPC.
Parallel computing is a topic of interest for a broad scientific community since it facilitates many time-consuming algorithms in different application domains.In this paper, we introduce a novel platform for parallel computing by using MPI and OpenMP programming languages based on set of networked PCs. UMZHPC is a free Linux-based parallel computing infrastructure that has been developed to cr...
متن کاملDesign and Evaluation of a 2048 Core Cluster System
Designing a 2048 core high performance cluster, including an appropriate parallel storage complex and a high speed network, under the pressure of limited budget (2.6 Mio Euro), performance, thermal and space limitations is really a challenging task. In this paper, we present our design decisions and their reasons, our experiences during the installation stage as well as performance numbers usin...
متن کاملGreen Energy-aware task scheduling using the DVFS technique in Cloud Computing
Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...
متن کاملTopic 18: Parallel I/O and Storage Technology
Input and output (I/O) is a major performance bottleneck for large-scale scientific applications running on parallel platforms. For example, it is not uncommon that performance of carefully tuned parallel programs can slow dramatically when they read or write files. This is because many parallel applications need to access large amounts of data, and although great advances have been made in the...
متن کاملEnhance parallel input/output with cross-bundle aggregation
The exponential growth of computing power on leadership scale computing platforms imposes grand challenge to scientific applications’ input/output (I/O) performance. To bridge the performance gap between computation and I/O, various parallel I/O libraries have been developed and adopted by computer scientists. These libraries enhance the I/O parallelism by allowing multiple processes to concurr...
متن کاملComputing Applications Parallel Application Simulation Parallel Simulation of Large-scale Parallel Applications
Accurate and efficient simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete-event simulation. This paper describes MPI-SIM, a direct execution-driven parallel simulator designed to predict the performance of existing MPI and MPI-IO application. MPI-SIM can be used to predict the performance of these programs as a function of archite...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006